Exploring a Continuous and Flexible Representation of the Lexicon

نویسندگان

  • Pierre Marchal
  • Thierry Poibeau
چکیده

We aim at showing that lexical descriptions based on multifactorial and continuous models can be used by linguists and lexicographers (and not only by machines) so long as they are provided with a way to efficiently navigate data collections. We propose to demonstrate such a system. 1 Background and Motivations “You shall know a word by the company it keeps!” (Firth, 1957). This all too well-known citation motivates any lexicographic work today: it is widely accepted that word description cannot be achieved without the analysis of a large number of contexts extracted from real corpora. However, this is not enough. The recent success of deep learning approaches has shown that discrete representations of the lexicon are no longer appropriate. Continuous models offer a better representation of word meaning, because they encode intuitively valid and cognitively plausible principles: semantic similarity is relative, contextsensitive and depends on multiple-cue integration. At this point, one may say that it doesn’t matter if these models are too abstract and too complex for humans as they are used by machines. We think this argument is wrong. If continuous models offer a better representation of the lexicon, we must conceive new lexical databases that are usable by humans and have the same basis as these continuous models. There are arguments to support this view. For example, it has been demonstrated that semantic categories have fuzzy boundaries and thus the number of word meanings per lexical item is to a large extent arbitrary (Tuggy, 1993). Although this still fuels lots of discussions among linguists and lexicographers, we think that a description can be more or less fine-grained while maintaining accuracy and validity. Moreover, it has been demonstrated that lexical entries in traditional dictionaries overlap and different word meanings can be associated with a sole example (Erk and McCarthy, 2009), showing that meaning cannot be sliced into separate and exclusive word senses. The same problem also arises when it comes to differentiating between arguments and adjuncts. As said by Manning (2003): “There are some very clear arguments (normally, subjects and objects), and some very clear adjuncts (of time and ‘outer’ location), but also a lot of stuff in the middle”. A proper representation thus need to be based on some kind of continuity and should take into consideration not only the subject and the object, but also the prepositional phrases as well as the wider context. Some applications already address some of the needs of lexicographers in the era of big data, i.e. big corpora in this context. The most well-known application is the SketchEngine (Kilgarriff et al., 2014). This tool has already provided invaluable services to lexicographers and linguists. It gives access to a synthetic view of the different usages of words in context. For example, the SketchEngine can give a direct view of all the subjects or complements of a verb, ranked by frequency or sorted according to various parameters. By exploding the representation, this tool provides an interesting view of the lexicon. However, in our opinion, it falls short when it comes to showing the continuous nature of meaning. This work is licenced under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/ Here we propose a system that combines the advantages of existing tools (a wide coverage database offering a synthetic view of a large vocabulary) with those of a dynamic representation. We focus on verbs since these lexical items offer the most complex syntactic and semantic behaviors. More specifically, we examine Japanese verbs, as Japanese is a language that presents a complex system of case markers that are generally semantically ambiguous. 2 Outline of our approach When building a verb lexicon, numerous challenges arise such as the notion of lexical item – that is, how many entries and subentries are necessary to describe the different meanings of a given verb? – and the distinction between arguments and adjuncts – that is, what complements are necessary to describe a particular meaning of a given verb? Following up on studies in natural language processing and linguistics, we embrace the hypothesis of a continuum between ambiguity and vagueness (Tuggy, 1993), and the hypothesis that there is no clear distinction between arguments and adjuncts (Manning, 2003). Although this approach has been applied and evaluated for Japanese, the theoretical framework to compute the argumenthood of a complement, or build the hierarchical structure of the lexical entries, is partially independent. We assume a list of verbal structures that have been automatically extracted from a large representative corpus. A verbal structure is an occurrence of a verb and its complements (expressed as syntactic dependencies); a complement is an ordered pair of a lexical head and a case marker. Computing the argumenthood of complements Following up on previous studies on the distinction between arguments and adjuncts (Manning, 2003; Merlo and Esteve Ferrer, 2006; Fabre and Bourigault, 2008; Abend and Rappoport, 2010), we propose a new measure of the degree of argumenthood of complements, derived from the famous TF-IDF weighting scheme used in information retrieval: argumenthood(v, c) = (1 + log count(v, c)) log |V | |{v′ ∈ V : ∃(v′, c)}| (1) where c is a complement (i.e. an ordered pair of a lexical head and a case particle); v is a verb; count(v, c) is the number of cooccurrences of the complement c with the verb v; |V | is the total number of unique verbs; |{v′ ∈ V : ∃(v′, c)}| is the number of unique verbs cooccurring with this complement. That is, we are dealing with complements instead of terms, and with verbs instead of documents. This measurement captures two important rules of thumb for distinguishing between arguments and adjuncts. The first part of the formula (1 + log count(v, c)) takes the idea that complements appearing frequently with a given verb tend to be arguments; the second part of the formula log |V | |{v′∈V :∃(v′,c)}| , that complements which appear with a large variety of verbs tend to be adjuncts. The proposed measure assigns a value between 0 and 1 to a complement – 0 corresponds to a prototypical adjunct; 1 corresponds to a prototypical argument – and thus model a continuum between arguments and adjuncts. Enriching verb description using shallow clustering A verbal structure corresponds to a specific sense of a given verb; that is, the sense of the verb is given by the complements selected by the verb. Yet a single verbal structure contains a very limited number of complements. So as to obtain a more complete description of the verb sense, we propose to merge verbal structures corresponding to the same meaning of a given verb into a minimal predicate-frame using reliable lexical clues. We call this technique shallow clustering. Our method relies on the principles that i) two verbal structures describing the same verb and having at least one common complement might correspond to the same verb sense, and that ii) some complements are more informative than others for a given sense. As for the second principle, the measure of argumenthood, introduced in the previous section, serves as a tool for identifying the complements which contribute the most to the verb meaning. Our method merges verbal structures in an iterative process – beginning with the most informative complements (i.e. complements yielding the highest argumenthood value) – as shown in Algorithm 1. Data: A set W of verbal structures (v,D), where v is a verb and D is a list of complements Result: A set W′ of minimal predicate-frames (v,D′) such that |W ′| 6 |W | W ′ ← ∅; foreach verb v in {v : ∃(v,D) ∈W} do /* Let C’ be the list of complements cooccurring with v sorted by argumenthood values in non-increasing order */ C ← {c : ∃(v,D) ∈W ∧ c ∈ D}; C ′ ← (c : c ∈ C ∧ argumenthood(v, C ′[i]) > argumenthood(v, C ′[i+ 1])); for i← 0 to length(C’)− 1 do /* Let D ′ be a subset of {D : ∃(v,D) ∈W} */ D′ ← ∅; foreach list of complements D in {D : ∃(v,D) ∈W} do if C ′[i] ∈ D then add D to D′; remove (v,D) from W ; end end foreach list of complements D in {D : ∃(v,D) ∈W} do if ∃X ∈ D′ such that D ⊂ X then add D to D′; remove (v,D) from W ; end end if |D′| > 2 then add the minimal predicate-frame (v,D′) to W ′; end end Algorithm 1: Shallow clustering of verbal structures. Modeling word senses through hierarchical clustering We propose to cluster the minimal predicateframes built during the shallow clustering procedure into a dendrogram structure. A dendrogram allows the definition of an arbitrary number of classes (using a threshold) and thus fits nicely with our goal of modeling a continuum between ambiguity and vagueness. A dendrogram is usually built using a hierarchical clustering algorithm, with a distance matrix as its input. So as to measure the distance between minimal predicate-frames, we propose to represent minimal predicate-frames as vectors which would then serve as arguments of a similarity function. Following previous studies on semantic composition, we suppose that “the meaning of a whole is a function of the meaning of the parts and of the way they are syntactically combined” (Partee, 1995) as well as all the information involved in the composition process (Mitchell, 2011). The following equation summarizes the proposed model of semantic composition:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mental Representation of Cognates/Noncognates in Persian-Speaking EFL Learners

The purpose of this study was to investigate the mental representation of cognate and noncognate translation pairs in languages with different scripts to test the prediction of dual lexicon model (Gollan, Forster, & Frost, 1997). Two groups of Persian-speaking English language learners were tested on cognate and noncognate translation pairs in Persian-English and English-Persian directions with...

متن کامل

Approximation of fixed points for a continuous representation of nonexpansive mappings in Hilbert spaces

This paper introduces an implicit scheme for a   continuous representation of nonexpansive mappings on a closed convex subset of a Hilbert space with respect to a   sequence of invariant means defined on an appropriate space of bounded, continuous real valued functions of the semigroup.   The main result is to    prove the strong convergence of the proposed implicit scheme to the unique solutio...

متن کامل

Strong convergence of a general implicit algorithm for variational inequality problems and equilibrium problems and a continuous representation of nonexpansive mappings

We introduce a general implicit algorithm for finding a common element of‎ ‎the set of solutions of systems of equilibrium problems and the set of common fixed points‎ ‎of a sequence of nonexpansive mappings and a continuous representation of nonexpansive mappings‎. ‎Then we prove the strong convergence of the proposed implicit scheme to the unique solution of the minimization problem on the so...

متن کامل

Models of EFL Learners’ Vocabulary Development: Spreading Activation vs. Hierarchical Network Model

Semantic network approaches view organization or representation of internal lexicon in the form of either spreading or hierarchical system identified, respectively, as Spreading Activation Model (SAM) and Hi- erarchical Network Model (HNM). However, the validity of either model is amongst the intact issues in the literature which can be studied through basing the instruction compatible wi...

متن کامل

On Some Characterization of Generalized Representation Wave-Packet Frames Based on Some Dilation Group

In this paper we consider  (extended) metaplectic representation of the  semidirect product  $G_{mathbb{J}}=mathbb{R}^{2d}timesmathbb{J}$  where $mathbb{J}$ is a closed subgroup of $Sp(d,mathbb{R})$, the symplectic group. We will investigate continuous representation frame on $G_{mathbb{J}}$. We also discuss the existence of duals for such frames and give several characterization for them. Fina...

متن کامل

Continuous speech recognition in the WAXHOLM dialogue system

This paper presents the status of the continuous speech recognition engine of the WAXHOLM project. The engine is a software only system written in portable C code. The design is flexible and different modes for phonetic pattern matching are available. In particular, artificial neural networks and standard multiple Gaussian mixtures are implemented for phone probability estimation, and for resea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016